My data analysis workflow depends on R. I tend to use old Matlab code, run in Octave or via oct2py, or new Python code for data data wrangling. I have moved to matplotlib and seaborn for all graphics. I still depend on R for basic stats, multivariate analyses, and machine learning. There is so much in the R universe and, with the easy-to-use rpy2 library, there is no reason not to use R.

%R magic is provided by rpy2, and it works really well for interactive data analysis or one-off calls to specialized libraries from CRAN. However, for more intesive analyses of data from multiple experiemnts, I found issues with memory management in rpy2 (documented here). My laptop PC does not have enough RAM (8 GBs) for me to run through a batch of LFPs files from, say, a dozen experiment (e.g. 12x16 or 192 channels with typically more than a million samples. It took a bit of work to figure out how to release and clean memory between channels or files. I tried to document my solutions to this issue in the post below.

A Jupyter notebook for this post is available here.


In [1]:
import numpy as np, pandas as pd, feather
from scipy.io import loadmat, savemat

(I have found that the rpy2 extension only works on my PCs [Linux Mint 17 and Anaconda for Python 3.5] if I import the readline libraary before the rpy2 extension.)


In [2]:
import readline
%load_ext rpy2.ipython

the memory_profiler library

I used the memory_profiler library to assess memory usage in this notebook.


In [3]:
# https://pypi.python.org/pypi/memory_profiler
from memory_profiler import memory_usage
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[130.04296875]

Switch to ~/temp folder for writing files. (Most of my PCs are backed up via Dropbox and SpiderOak and I hate wasting bandwidth.)


In [4]:
%cd ~/temp


/home/mark/temp

clean up from previous runs, as writing over takes longer than deleting and writing


In [5]:
%rm test*.*

Create some data

typical LFP matrix, 32 channels, 1.5 million samples at 1 kHz


In [6]:
ADmat = np.random.randn(32, 1500000) / 100

In [7]:
whos ndarray


Variable   Type       Data/Info
-------------------------------
ADmat      ndarray    32x1500000: 48000000 elems, type `float64`, 384000000 bytes (366.2109375 Mb)

In [8]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[496.5546875]

save the matrix for analysis in R


In [9]:
np.save('test.npy', ADmat)

In [10]:
%ls -lstr test.npy


375012 -rw-r--r-- 1 mark mark 384000080 Aug  1 11:42 test.npy

In [11]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[496.60546875]

load into R using R magic

RcppCNPy is a fantastic library for R that lets you read and write numpy data file.


In [12]:
%%R
library(RcppCNPy)
setwd("~/temp")
ADmat = npyLoad('test.npy', type="numeric", dotranspose=FALSE)

In [13]:
%R str(ADmat)


 num [1:32, 1:1500000] 0.017809 -0.006086 0.002423 -0.000249 -0.013145 ...

In [14]:
%R ls()


Out[14]:
array(['ADmat'], 
      dtype='<U5')

In [15]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[877.0546875]

remove ADmat to assess memory use with Rpush below


In [16]:
%R rm(list=ls())

In [17]:
%R ls()


Out[17]:
array([], dtype=float64)

In [18]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[877.0546875]

ADmat is gone but memory is not released


In [19]:
%R gc(); # garbage collection

In [20]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[510.95703125]

now the memory is released



In [21]:
%Rpush ADmat

htop (and some calcs in bc) report an extra 340 to 360 MB (variable over runs) following Rpush... why?


In [22]:
%R str(ADmat)


 num [1:32, 1:1500000] 0.0178 0.0141 0.0032 0.004 -0.0101 ...

In [23]:
%R ls()


Out[23]:
array(['ADmat'], 
      dtype='<U5')

In [24]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[1244.09375]

%Rpush uses a lot more memory than if you save the file with numpy and load into R using the RcppCNPy library!


In [25]:
%R rm(ADmat)

In [26]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[1244.1171875]

In [27]:
%R gc();

In [28]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[877.9140625]

now the memory consumed by ADmat is released but the extra memory is still consumed


In [29]:
%R ls()


Out[29]:
array([], dtype=float64)

the empy array is there after ADmat is gone


In [30]:
%R rm(list=ls())

In [31]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[877.9140625]

In [32]:
%R gc();

In [33]:
mem_usage = memory_usage(-1, interval=1, timeout=1)
print(mem_usage)


[511.7109375]

now we are back to where we started!

Conclusions

  • Clear the R workspace of all variables and run gc between loops
  • If your data takes up a lot of memory, use numpy to save the variables into files and load them in R using the RcppCNPy library.